EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

نویسندگان

  • Michael Beißwenger
  • Sabine Bartsch
  • Stefan Evert
  • Kay-Michael Würzner
چکیده

This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web. The teams participating in the shared task achieved a substantial improvement over current off-the-shelf tools for German. The best tokenizer reached an F1score of 99.57% (vs. 98.95% off-the-shelf baseline), while the best tagger reached an accuracy of 90.44% (vs. 84.86% baseline). The gold standard (more than 20,000 tokens of training and test data) is freely available online together with detailed annotation guidelines. 1 Motivation, premises and goals Over the past decade, there has been a growing interest in collecting, processing and analyzing data from genres of computer-mediated communication and social media interactions (henceforth referred to as CMC) such as chats, blogs, forums, tweets, newsgroups, messaging applications (SMS, WhatsApp), interactions on “social network” sites and on wiki talk pages. The development of resources, tools and best practices for automatic linguistic processing and annotation of CMC discourse has turned out to be a desideratum for several fields of research in the humanities: 1. Large corpora crawled from the Web often contain substantial amounts of CMC (blogs, forums, etc.) and similar forms of noncanonical language. Such data are often regarded as “bycatch” that proves difficult for linguistic annotation by means of standard natural language processing (NLP) tools that are optimized for edited text (Giesbrecht and Evert, 2009). 2. For corpus-based variational linguistics, corpora of CMC discourse are an important resource that closes the “CMC gap” in corpora of contemporary written language and language-in-interaction. With a considerable part of contemporary everyday communication being mediated through CMC technologies, up-to-date investigations of language change and linguistic variation need to be able to include CMC discourse in their empirical analyses. In order to harness the full potential of corpusbased research, the preparation of any type of linguistic corpus which includes CMC discourse— whether a genuine CMC corpus or a broadcoverage Web corpus—faces the challenge of handling and annotating the linguistic peculiarities characteristic for the types of written discourse found in CMC genres. Two fundamental (but nontrivial) tasks are (i) accurate tokenization and (ii) sufficiently reliable part-of-speech (PoS) annotation. Together, they provide a layer of basic linguistic information on the token level that is a pre-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

bot.zen $@$ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)

This article describes the system that participated in the Part-of-speech tagging subtask of the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / so-

متن کامل

EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres

We present our system used for the AIPHES team submission in the context of the EmpiriST shared task on “Automatic Linguistic Annotation of ComputerMediated Communication / Social Media”. Our system is based on a rulebased tokenizer and a machine learning sequence labelling POS tagger using a variety of features. We show that the system is robust across the two tested genres: German computer me...

متن کامل

Tags Re-ranking Using Multi-level Features in Automatic Image Annotation

Automatic image annotation is a process in which computer systems automatically assign the textual tags related with visual content to a query image. In most cases, inappropriate tags generated by the users as well as the images without any tags among the challenges available in this field have a negative effect on the query's result. In this paper, a new method is presented for automatic image...

متن کامل

Fuzzy Neighbor Voting for Automatic Image Annotation

With quick development of digital images and the availability of imaging tools, massive amounts of images are created. Therefore, efficient management and suitable retrieval, especially by computers, is one of themost challenging fields in image processing. Automatic image annotation (AIA) or refers to attaching words, keywords or comments to an image or to a selected part of it. In this paper,...

متن کامل

A CAD System Framework for the Automatic Diagnosis and Annotation of Histological and Bone Marrow Images

Due to ever increasing of medical images data in the world’s medical centers and recent developments in hardware and technology of medical imaging, necessity of medical data software analysis is needed. Equipping medical science with intelligent tools in diagnosis and treatment of illnesses has resulted in reduction of physicians’ errors and physical and financial damages. In this article we pr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016